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Abstract 

Machine learning and geostatistics are powerful mathematical frameworks for modeling spatial 
data. Both approaches, however, suffer from poor scaling of the required computational resources 
for large data applications. We present the Stochastic Local Interaction (SLI) model, which em¬ 
ploys a local representation to improve computational efficiency. SLI combines geostatistics and 
machine learning with ideas from statistical physics and computational geometry. It is based on a 
joint probability density function defined by an energy functional which involves local interactions 
implemented by means of kernel functions with adaptive local kernel bandwidths. SLI is expressed 
in terms of an explicit, typically sparse, precision (inverse covariance) matrix. This representation 
leads to a semi-analytical expression for interpolation (prediction), which is valid in any number 
of dimensions and avoids the computationally costly covariance matrix inversion. 
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I. INTRODUCTION 


Big data is expected to have a large impact in the geosciences given the abundance 
of remote sensing and earth-based observations related to climate in 121 ■ A similar data 
explosion is happening in other scientific and engineering fields [41] • This trend underscores 
the need for algorithms that can handle large data sets. Most current methods of data 
analysis, however, have not been designed with size as a primary consideration. This has 
inspired statements such as: “Improvements in data-processing capabilities are essential 
to make maximal use of state-of-the-art experimental facilities” [3j- Machine learning can 
extract information and “learn” characteristic patterns in the data. Thus, it is expected to 
play a significant role in the era of big data research. The application of machine learning 
methods in spatial data analysis has been spearheaded by Kanevski [2S]. Machine learning 
and geostatistics are powerful frameworks for spatial data processing. A comparison of their 
performance using a set of radiological measurements is presented in p2j. The question 
that we address in this work is whether we can combine ideas from both fields to develop a 
computationally efficient framework for spatial data modeling. 

Most data processing and visualization methods assume complete data sets, whereas in 
practice data often have gaps. Hence, it is necessary to fill missing values by means of 
imputation or interpolation methods. In geostatistics, such methods are based on various 
flavors of stochastic optimal linear estimation (kriging) [7J. In machine learning, methods 
such as fc-nearest neighbors, artificial neural networks, and the Bayesian framework of Gaus¬ 
sian processes are used [3Tj . Both geostatistics and Gaussian process regression are based 
on the theory of random fields and share considerable similarities mm- The Gaussian 
process framework, however, is better suited for applications in higher than two dimensions. 
A significant drawback of most existing methods for interpolation and simulation of missing 
data is their poor scalability with the data size N, i.e., the 0(N 3 ) algorithmic complexity 
and the 0(N 2 ) memory requirements: An 0(N P ) dependence implies that the respective 
computational resource (time or memory) increases with N as a polynomial of degree at 
most equal to p. 

Improved scaling with data size can be achieved by means of local approximations, dimen¬ 
sionality reduction techniques, and parallel algorithms. A recent review of available methods 
for large data geostatistical applications is given in [35]. These approaches employ clever 
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approximations to reduce the computational complexity of the standard geostatistical frame¬ 
work. Local approximations involve methods such as maximum composite likelihood ^U] 
and maximum pseudo-likelihood [38]. Another approach involves covariance tapering which 
neglects correlations outside a specified range dnuiniES]. Dimensionality reduction includes 
methods such as fixed rank kriging which models the precision matrix by means of a fixed 
rank matrix r ^ N 0130]. Markov random fields (MRFs) also take advantage of locality 
using factorizable joint densities. The application of MRFs in spatial data analysis was ini¬ 
tially limited to structured grids [32]. However, a recently developed link between Gaussian 
random fields and MRFs via stochastic partial differential equations (SPDE) has extended 
the scope of MRFs to scattered data [28]. 

We propose a Stochastic Local Interaction (SLI) model for spatially correlated data which 
is based by construction on local correlations. SLI can be used for the interpolation and 
simulation of incomplete data in d -dimensional spaces, where d could be larger than 3. 
The SLI model incorporates concepts from statistical physics, computational geometry, and 
machine learning. We use the idea of local interactions from statistical physics to impose 
correlations between “neighboring” locations by means of an explicit precision matrix. The 
local geometry of the sampling network plays an important role in the expression of the 
interactions, since it determines the size of local neighborhoods. On regular grids, the SLI 
model becomes equivalent to a Gaussian MRF with specific structure. For scattered data, 
the SLI model provides an alternative to the SPDE approach that avoids the preprocessing 
cost involved in the latter. 

The SLI model extends previous research on Spartan spatial random fields [13, 21], [22] to 
an explicitly discrete formulation and thus enables its application to scattered data without 
the approximations used in [23]. SLI is based on a joint probability density function (pdf) 
determined from local interactions. This is achieved by handling the irregularity of sampling 
locations in terms of kernel functions with locally adaptive bandwidth. Kernel methods are 
common in statistical machine learning [39] and in spatial statistics for the estimation of 
the variogram and the covariance function mmm- 

The remainder of the article is structured as follows: Section [TT] briefly introduces useful 
definitions and terminology. In Section |III| we construct the SLI model, propose a compu¬ 
tationally efficient parameter estimation approach, and formulate an explicit interpolation 


expression. In Section IV we investigate SLI interpolation using different types of simulated 
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and real data in one, two and four dimensional Euclidean spaces. Section [V] discusses poten¬ 
tial extensions of the current SLI version and connections with machine learning. Finally, 


in Section VI we present our conclusions and point to future research. 


II. BACKGROUND CONCEPTS AND NOTATION 

A. Definition of the problem to be learned 

a. Sampling grid The set of sampling points is denoted by Sn = {sx,... ,s N }, where 
Sj, % — 1,..., N are vectors in the Euclidean space or in some abstract feature space that 
possesses a distance metric. In Euclidean spaces, the domain boundary is defined by the 
convex hull, 'H(Sn), of Sn- 

b. Sample and predictions The sample data are denoted by the vector xg = {x \,..., xn) t , 
where the superscript “T” denotes the transpose. Interpolation aims to derive estimates of 
the observed field at the nodes of a regular grid Q C Tf, or at validation set points which 
may be scattered. The estimates (predictions) will be denoted by £(s p ), p — 1,..., P, i.e., 
x P = (£ 1 ,.. .,x P ) T . 

c. Spatial random field model The data xs are assumed to represent samples from a 
spatial random field (SRF) Xi(oj), where the index i — 1,..., N denotes the spatial location 
Sj e Sn- The expectation over the ensemble of probable states is denoted by E[Xj(o;)], and 
the autocovariance function is given by Cjj := E[Xj(o;) Xjipj)] — E[Xj(o;)] E[X,-(o;)]. 

The pdf of Gibbs SRFs can be expressed in terms of an energy functional iJ(x§; 0), where 
6 is a set of model parameters , according to the Gibbs pdf [321 P- 51] 

-tf(x s;0) 

/x(x s ; 6) = ~z(jd) ' 

The constant Z{9), called the partition function, is the pdf normalization factor obtained 
by integrating over a p the probable states xg. 

B. From continuum spaces to scattered data 

The formulation based on ([Tj) has its origins in statistical physics, and it has found 
applications in pattern analysis IS EH! and Bayesian field theory, e.g. |Hl 127]. In statistics, 


4 




TABLE I: Definitions of kernel functions used in Section UV] below. The first three have 
compact support. Notation: u = ||r||//i where ||r|| is the distance and h the bandwidth; 
1 a(u) is the indicator function of the set A, i.e., 1 a(u) = 1, u E A and 1 a(u) = 0, u £ A. 


Triangular K(u) — (1 — u) 1| u |<i(m) 
Tricube K{u) = (1 — u 3 ) 3 l| u |<i(^) 
Quadratic K{u) = (1 — u 2 ) l| u |<i(w) 
Gaussian K (u) = exp(— u 2 ) 
Exponential K(u) = exp(—|u|) 


this general model belongs to the exponential family of distributions that have desirable 
mathematical properties [5]. Our group used the exponential density in connection with a 
specific energy functional to develop Spartan spatial random fields (SSRF’s) [ 131 12T1 SHI . In 
Section [Hi] we construct an explicitly discrete model motivated by SSRFs which adapts local 
interactions to general sampling networks and prediction grids by means of kernel functions. 


C. Kernel weights 


Let K (r) be a non-negative-valued kernel that is either compactly supported or decays 
exponentially fast at large distances (e.g., the Gaussian or exponential function). We define 
kernel weights associated with the sampling points s* and s j as follows 


Kid = K 




( 2 ) 


where ||sj — Sj|| is the distance (Euclidean or other) between two points s* and s v whereas 
hi is the respective kernel bandwidth that adapts to local variations of the sampling pattern. 
The kernel weight K 3 l is defined in terms of a bandwidth hj. Hence, TQj ^ Kj^ if the 
bandwidths hi and hj are different. Examples of kernel functions are given in Table |T| 

Let (A/v) denote the distance between s* and its k- nearest neighbor in Sn (k = 0 
corresponds to zero distance). We choose the local bandwidth associated with s, : according 
to 


hi fr (S)y), (3) 

where ji > 1 and k > 1 are model parameters. In several case studies involving Euclidean 
spaces of dimension d = 1,2, 3,4, we determined that k — 2 (second nearest neighbors) 
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performs well for compactly supported kernels and k = 1 (nearest neighbors) for infinitely 
supported kernels. Using k — 2 for compact kernels avoids zero bandwidth problems which 
result from k = 1 for collocated sampling and prediction points. Since the sampling point 
configuration is fixed, /i and A;,[fc](*S/v) determine the local bandwidths. A,[fc](SW) depends 
purely on the sampling point configuration, but fi also depends on the sample values. For 
compactly supported kernels setting k = 1 only makes sense if /i > 1; otherwise hi = 
Di ^ = i](S]sr) implying that the kernel vanishes even for the nearest-neighbor pairs and thus 
fails to implement interactions. 


D. Kernel averages 


For any two-point function $(■), we use a local-bandwidth extension of the Nadaraya- 
Watson kernel-weighted average over the network of sampling points [291111] 


<*(0>h = 


Eili Ef=i Kij $(•) 

EixEjlitfy 


where h = (hi, ..., h^) T is the vector of local bandwidths. The function <&(•) represents the 
distance between two points ||s* — Sj|| or the difference Xi — Xj of the held values, or any 
other function that depends on the locations or the values of the held. The kernel average 
is normalized so as to preserve unity, i.e., (l)h = 1 for all possible point configurations. 


III. THE STOCHASTIC LOCAL INTERACTION (SLI) MODEL 


The SLI joint pdf is determined below by means of the energy functional (|4]). This leads 
to a precision matrix which is explicitly defined in terms of local interactions and thus avoids 
the covariance matrix inversion. The prediction of missing data is based on maximizing the 
joint pdf of the data and the predictand, which is equivalent to minimizing the corresponding 


energy functional. This leads to the mode predictor (21), which involves a calculation with 
linear algorithmic complexity. 
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A. The energy functional 


Consider a sample x s on an unstructured sampling grid with sample mean /i X • We 
propose the following energy functional i/ x (xsi 0) 


#x(x s ; 0 ) = dV [ 5 o(x s ) + « 1 5i(x s ; hi) + a 2 S 2 (x s ; h 2 )] , (4) 

where 6 = (/i x , aq, a 2 , A, /i, fc) is the SL1 parameter vector and the parameters / j,,k are 
defined in Section Hi Cl above. 

The terms S 0 (x s ), <Si(x s ;hi), and 5 2 (xg;h 2 ) correspond to the averages of the square 
fluctuations, the square gradient and the square curvature in a Euclidean space of dimension 
d. The latter two are given by kernel-weighted averages that involve the field increments 

X . rp . rp . 


s 0 (x s ) = 

i= 1 

*5i(x s ; hi) = d(xl j ) hl , 


(5) 

( 6 ) 


5 2 (x s ; h 2 ) = c 2j i (xlj) h2 - c 2 , 2 (x?j) hs - c 2)3 (x-j) h4 , (7a) 

where c 2 i = 4 d(d + 2), c 2i2 = 2 d(d — 1), c 2j3 = d. (7b) 


The c 2 j (j = 1,2,3) values in S 2 are motivated by discrete approximations of the square 
gradient and curvature [22] • We use two vector bandwidths, hi and h 2 , to determine 
the range of influence of the kernel function around each sampling point for the gradient 


d>i(x n ; hi) and curvature d> 2 (x n ; h 2 ) terms respectively. Additional bandwidths used in (7a) 
for S 2 (x n ; h 2 ) are defined by h 3 = v / 2h 2 , hi = 2h 2 . These definitions are motivated by the 
formulation of SSRFs [21. [23]. 
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B. SLI parameters and permissibility 


To obtain realistic kernel bandwidths, k should be a positive integer larger than one, 
and p should be larger than one. The parameter fix is set equal to the sample mean. The 
coefficients aii,a 2 control the relative contributions of the mean square gradient and mean 
square curvature terms. The coefficient A controls the overall amplitude of the fluctuations. 
Finally, fi and k control the bandwidth values as described in Section II C| 

The SLI energy functional (Jfj) is permissible if //xfxy 6 ) > 0 for all xs, a condition which 
ensures that e _ ^ xs;0 ) is bounded and thus the existence of the partition function in ([Tj). 
Assuming that S 2 > 0 (<Sq and <Si are always non-negative by construction), a sufficient 
permissibility condition, independently of the distance metric used, is ai,a 2 ,X > 0. In all 
the case studies that we have investigated, however, we have not encountered permissibility 
problems so long as aq, a 2 , A > 0. Intuitively, the justification for the permissibility of Q is 
that the first average, i.e., (xjj) h 2 in (7a) has a positive sign and is multiplied by C 24 , which 
is significantly larger (especially as d increases) than the coefficients 02,2 and c 2 ^ multiplying 
the negative-sign averages (xfj) h 3 and (xf 3 ) h 4 . This property is valid for geodesic distances 
on the globe and for other metric spaces as well. 


C. Precision matrix representation 

We express Q in terms of the precision matrix Jij(O) (i,j = 1,..., N) 

iL x (x s ;0) = ^(x s - a* x ) T J(0) (x s - At x ). (8) 

The symmetric precision matrix J(0) follows from expanding the squared differences 
in ([4]), leading to the following expression 


J(0) — | -jy + ai d Ji(hi) + a 2 [c 2 .i ^(fkz) — 02,2 Js(h3) — 02,3 J 4 (h 4 )] |>, (9) 

where Iat is the N x N identity matrix: [Ijv]*,j = 1 if i — j and [Ijv]»,j = 0 otherwise, and 
J 9 (h 9 ), q = 1, 2, 3,4 are network matrices that are determined by the sampling pattern, the 
kernel function, and the bandwidths. The index q defines the gradient network matrix for 






q = 1, whereas the values q — 2, 3,4 specify the curvature network matrices that correspond 


to the three terms in ^(xg; h 2 ) given by (7a). The elements of the network matrices J g (h g ) 
are given by the following equations 


V*i,j {hq\i) 


N 

{hq:i) T [bv] i,j ^ ^ T 5 

Z=1 



q = 1, • • • , 4. 


(10a) 

(10b) 


The network matrices defined by (10) are symmetric by construction. It follows from (10) 
that the row and column sums vanish, i.e., 


N 


— 0 . 

3 =1 


( 11 ) 


Based on (10a), the diagonal elements are given by the following expression 


N 

[Jg(hg)]jj ^ ~' j ['U'iJ (hq\i) T W; ! j(/lq^)] . (12) 

Since the kernel weights are non-negative, it follows that the sub-matrices J 9 (h 9 ) are di¬ 
agonally dominant , i.e., | [Jq(h g )] iii | > | [Jg(h<j)]ij|- It also follows from (J9| and 0 

that 


N 




1 

iVA' 


(13a) 


D. Parameter inference 

We have experimented both with maximum likelihood estimation and leave-one-out cross 
validation. The former requires the calculation of the SLI partition function, which is an 
0(N 3 ) operation for scattered data. For large data sets the 0(N 3 ) complexity is a com¬ 
putational bottleneck. Parameter inference by optimization of a cross validation metric is 
computationally more efficient, since it is at worst an 0(N 2 ) operation as we show below. 
The memory requirements for storing the precision matrix are 0(N 2 ) but can be signih- 
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cantly reduced by using sparse matrix structures. Let 0 _ a = (an, «2, /x, px) T represent the 
parameter vector excluding A. We use the following cross validation cost functional 


N 


$(x s ; 0 -a) = _ 


Xi 


(14) 


2=1 


where Xi(6^\) is the SLI prediction at s* based on the reduced sampling set Sn — {s*} using 
the parameter vector 0 _a which applies to all i — 1,..., N. The prediction is based on the 


interpolation equation (22) below and does not involve A (see discussion in Section VB). 
The optimal parameter vector excluding A, i.e., is determined by minimizing the 


cost functional (14): 


6*_ x = arg min $(xg; 0 


0 -> 


(15) 


If f/(xg; 0 a) is the energy estimated from ([8]) and ([9]) by setting A = 1, the optimal value 
A* is obtained by minimizing the negative log-likelihood with respect to A leading to the 
following solution (see 0 


A* = 


2j/(x s ;0- 

N 


(16) 


We determine the minimum of the cross validation cost functional (14) using the MATLAB 
constrained optimization function fmincon with the interior-point algorithm [43]. This 


function determines the local optimum nearest to the initial parameter vector. We use initial 
guesses for the parameters ai,a 2 ,/L and we assume that the parameters are constrained 
between the lower bounds [0.5, 0.5, 0.5] and the upper bounds [300, 300,15]. We investigated 
different initial guesses for the parameters which led to different local optima. We found, 
however, that the value of the cross validation function is not very sensitive on the local 


optimum. In the 4 D case study presented in Section [IV B[ we also estimate for comparison 
purposes the global optimum using Matlab’s global optimization tools. 


E. Predictive SLI model 


Let us now assume that the prediction point s p is added to the sampling points. To predict 
the unknown value of the field at s p , we insert this point in the energy functional (J8]) , which 
is then given by Eq. (19) below. Then, we determine the mode of the joint pdf ([Tj) with 
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the prediction point inserted in the energy functional. Thus, we obtain a mode prediction 
equation for x p given by (22) below. 


1. Modification of kernel weights 


Upon inclusion of s p , the weights (10b) of the network matrices (10a) are modified as 
follows 


u 


■,j {hq;i) 


I< 


E« K 


+ E, k 


Sj — Sp 
hn.i 


_l - _ k I Si ~ s p 


(17) 


where JT ■ := YljLv The hrst term in the denominator concerns interactions between 
sampling points. The second term involves local interactions between the sampling points 
and the prediction point which result from inserting the prediction point in the local neigh¬ 
borhoods of the sampling points, which control the bandwidths. Finally, the third term 
also involves interactions between the prediction point and the sampling points, but in this 
case the bandwidth is controlled by the former. Fig. [l] below illustrates the difference be¬ 
tween the second and third term in the context of the entire precision matrix. The index 
q is used to distinguish between the weights linked to the gradient (q — 1) and the three 
weights (q = 2,3,4) linked to the curvature terms. The only difference between weights 
with different q is the bandwidth. In the case of compactly supported kernels, different 
bandwidths imply that different numbers of pairs are involved in the summations, since a 
pair separated by a distance that exceeds the bandwidth does not contribute. Calculation of 
the predictand contributions in the denominator of © is an operation with computational 
complexity O(N) compared to 0(N 2 ) for the interactions between sampling points. The 
latter term, however, is calculated once and used for all the prediction points. 

In addition to the weights that correspond to pairs of sampling points, there are weights 
for combinations of sampling and prediction points, i.e., 


'ttp,j (hq:p) 


K 


Eyff 


+ E, K 


Sj — Sp 


+ E. K 


Si Sp 
ha.n 


( 18 ) 


where p — 1,..., P, j — 1,..., N. The denominator of (18) is identical to that of (pT|). 


11 
















2. SLI mode predictor 


Using the precision matrix formulation, the energy functional including the predictand is 
given by 


Hx(xs,x P ',0*) =#x(x s ;0*) + J PtP {0*)(xp 


N 

hx) 2 + “ Ux) "M 0 *) ( X P 

i =1 


N 

“b ^ ^ ( J'p /ix) Jp,i(0 ) ['E'l h'x)- 

j=l 


hx) 


(19) 


The elements of the precision matrix that involve the prediction point are 


N 


[Jg(hg)]p,p 'y ^ [ u i,p(hq;i) + u p,i{hq;p)] i 


i =1 


[Jq(hg)]jp |hUp(hg;j) T Upi{lljq-p)\ , i 7^ P- 


(20a) 

(20b) 


Based on (20a) the symmetry property J Pji (0*) = J i)P {0*) follows. The coefficients 
Ui.p{h (fl ) and u p ,i{h qvp ) differ due to the different bandwidths used (in the former, the band¬ 
width is determined by the neighborhood of the sampling point s*, whereas in the latter by 


the neighborhood of s p .) A schematic illustration of terms in (19) that involve the predic¬ 
tand is given in Fig. [lj The left diagram corresponds to terms “rooted” at s p (i.e., with 
coefficient u pp {h q . p ) that involves the bandwidth h p ), whereas the right hand side diagram 
corresponds to terms rooted at the sampling points, i.e., with coefficients u i}P (h q -i). 


The SLI mode predictor is defined by the following equation 


x p = arg min H x (x Sj x p , 0*), 


( 21 ) 


where f/x(xs, x p ] 0*) is given by (19). Minimization of the energy with respect to x p leads 
to the following mode estimator 
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(a) Bandwidth determined by s p . 


4 



(b) Bandwidth determined by s r 


FIG. 1: Schematic diagrams of terms contributing to (19) that include the prediction point 


s p (center point) and six sampling points s n (n — 1,..., 6). The diagram on the left (a) 
represents terms J P: i whereas the diagram on the right (b) represents terms Jj jP . The point 
at the “root” of each arrow determines the bandwidth for the weight that involves the two 

points connected by the arrow. 


sib fo - ax) 

2 J P , P {0*) 

E£Li J pA g *) (Xj - fix) 

j P , P m 




= /ix 


( 22 ) 


where the precision matrix elements are given by (10a) using the modified kernel weights (17) 


and (18). 


The SLI mode predictor can be generalized to P prediction points as follows 


x P = Aix - Jp,s(0*) (x - Mx), 


(23a) 


where Jp,s(0*) is a P x N matrix given by 


[Jp,s(0*)ki = JpAo*)/J P A d *)- 


(23b) 
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3. Properties of SLI predictor 


The SLI prediction (23) is unbiased in view of the vanishing row sum property (13a) 


satisfied by the network matrices and the precision matrix. The SLI prediction (23) is 


independent of the parameter A which sets the amplitude of the fluctuations, because the 
transfer matrix Jp,s(0*) is given by the ratio of precision matrix elements. This property 
is analogous to the independence of the kriging predictor from the random field variance. 
Hence, leave-one-out cross validation does not determine the optimal value of A, which is 


obtained from (16). 


The SLI predictor is not necessarily an exact interpolator. In particular, let us consider 
a point Sfc, k e {1,... ,1V}, which is very close to s p . Based on (20) and ([22]) , x p —> x k 


as s p -A s fc only if (i) \u k , p (h p )\ > \u i:P (h p )\ and (ii) \u ktP (h k )\ > \u i:P (hi)\ for all i ^ k. 
Condition (i) materializes only for compactly supported kernels if h p —> 0 which requires 
that the bandwidth be determined by the nearest neighbor distance. Condition (ii), on 
the other hand, requires that ||— s p \\/h k -C ||sj — s p \\/hi for i ^ k. This condition holds 
approximately at best if the sample is sparse around s p . 

The computational complexity of the SLI predictor is 0(N 2 + P N). The 0(N 2 ) term is 


due to the double summation over the sampling points in (17), which needs to be calculated 


only once. The remaining operations per each prediction point scale linearly with the sample 
size, hence the 0(P N) dependence. Based on the above, the dominant term (for fixed P) in 
the computational time scales as 0(N 2 ). In future work we will investigate approximating 


the double summation in the denominator of (10b) and (18) with analytically evaluated 


double integrals over the kernel functions to increase the computational efficiency. 


IV. CASE STUDIES 


We first consider two synthetic data sets, the first consisting of a time series and the 
second of a four-dimensional test function. We then investigate a set of scattered real data 
in two spatial dimensions. 
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A. Time series with Matern covariance function 


We generate a time series of length N = 300 from a random process with Matern co- 
variance C{t) = a 2 2 1 ~ u K v {t//Y) v /Y{v), where K v (-) is the modified Bessel function 
of order u, T(-) is the gamma function, a = 10, v = 3.5 is the smoothness index, and 
£ = 10 is the correlation time. We use 60 randomly selected points as the training set 
(corresponding to an 80% degree of thinning) and the remaining 240 points as the valida¬ 
tion set. The SL1 optimal parameters using a quadratic kernel and k — 2 are given by 
ctq ~ 29.30, «2 ~ 191.02,/i « 1.11, A ~ 297.84. The sparseness of the precision matrix is 


evident in Fig. [2aJ The darkest areas correspond to negative infinity and reflect distances 
for which the precision matrix vanishes. 

The prediction performance is illustrated in the scatter plot of the SL1 predictions versus 


the respective validation set values shown in Fig. 2b The Pearson correlation coefficient 
between the validation values and the predictions is 0.89. The splitting of the time series into 
training and validation sets is shown in Fig. [3] along with the SLI predictions and associated 
error bars. The SLI predictions capture well general features of the time series. However, 
in areas of low sampling density the SLI predictions smoothes excessively the fluctuations 
in the original series. The SLI performance is excellent for the same degree of thinning, if 
the length of the initial time series increases to 3 000. On the other hand, the prediction 
accuracy deteriorates for rougher random processes, such as a non-differentiable Matern 
process with v = 0.8. 


B. Four-dimensional deterministic test function 

We consider the function x(s) 

4 

x(s) = Ae“ 2||s ' a|! JJ Si (1 - s^, (24) 

1=1 

where A = 500 and a = (0.3, 0.3, 0.3,0.3), defined over the four-dimensional cube with 
unit length edges, i.e., for s 6 [0, l] 4 . We sample the function at N — 1000 randomly 
selected points over the unit cube, and we generate a validation set of N — 1000 points 
also by random selection. The SLI optimal parameters for the quadratic kernel with k = 2 
are given by aq ~ 10.12, a 2 ~ 25.04, /i ps 1.64, A ~ 0.0193 starting with initial values 
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Precision Matrix 
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(b) Predictions vs validation values 


(a) In \Jij\ 

FIG. 2: Analysis of time series with Matern correlations (er = 10, v = 3.5, £ = 10). (a) 
Logarithm of absolute value of the precision matrix; dark areas (blue online) correspond to 
low values whereas lighter areas (green to red online) correspond to higher values, (b) 
Scatter plot of SLI predictions versus the respective values of the validation set. 


ot\ = 10,a2 = 25, fi = 3. Similar results in terms of cross validation performance are 
also obtained with different initial conditions that lead to different local optima. The cross 
validation measures for the parameters above are given by ME= 0.0046, MAE= 0.0320, 
RMSE= 0.0459, r— 0.96. The sparse structure of the precision matrix is illustrated in 


Fig. 4a which displays the logarithm of the absolute value. The scatter plot of the validation 


values versus the respective SLI predictions is shown in Fig. Tb and demonstrates very good 
agreement at most points. 

We repeat the experiment by adding Gaussian noise to the sample. The standard devi¬ 
ation of the noise is set to ~ 10% of the maximum sampled value x max (in the simulations 
that we ran x max ~ 1). While the coefficients and a 2 remain practically unchanged, /i 
changes to ~ 1.83. The sparsity of the precision matrix is ~ 76%. The respective cross 
validation measures are given by ME= 0.012, MAE= 0.047, RMSE= 0.061, and r— 0.93. 

We also used the MATLAB global optimizer GlobalSearch with the same initial parameter 
vector as above to determine the SLI model parameters. GlobalSearch uses a scatter- 
search algorithm to generate starting points (initial parameter guesses). The minimization is 
conducted using fmincon to determine the local minimum close to the current starting point. 
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SLI Map Quad 



FIG. 3: SLI predictions at 240 validation points of time series with Matern correlations 
(a = 10, v = 3.5, £ = 10). The 60 training points are marked by stars (bine online), 
validation points are marked by dots (red online), and SLI predictions are marked by the 
continuous line (black online). Error bars are based on three times the conditional 

standard deviation (green online). 


We use the lower and upper bounds defined in Section HID to constrain the space of the 
starting points. GlobalSearch investigates a set of 66 starting points, and convergence to a 
local minimum is achieved for all of them. The globally optimal SLI parameters are estimated 
as cti ~ 1.50, «2 ~ 224.62, /i « 2.01, A ~ 0.748. The respective validation measures are given 
by ME= 0.0055, MAE= 0.045, RMSE= 0.063, and r = 0.93. These measures do not differ 


17 















WUMa -I-I- — 

,■ ■ 


100 ■ 

,i ■• < 1 . - ’ ..■Jt 

200 & 



!|i I ; itllf : I iiSilllSlii I® till! ill 

•■■;> r llfl 

■ 

is v>Ut V>' #%'* A 1 ' r > :*'" r - i : * '■ • G V! 


900 • 

■I __ 


vBamam 


200 


400 600 800 


(a) In | Jj >: 



1000 


0 0.2 0.4 0.6 0.8 

Predictions 

(b) Predictions vs validation values 


FIG. 4: Analysis of the truncated exponential function (24). (a) Logarithm of the 
absolute value of the precision matrix; dark areas (blue online) correspond to low values 
whereas lighter areas (green to red online) correspond to higher values, (b) Scatter plot of 
SLI predictions versus the respective values of the validation set. 


significantly from those obtained with the locally optimum solution. The MAE value is 
lower for the global optimum, which is expected since MAE reflects the value of the cost 


function (14). On the other hand, the RMSE obtained with the global optimum is slightly 


higher than that of the local optimum. This result indicates that quite different parameter 
vectors can lead to similar cross validation results. This behavior has also been observed 
with covariance models whose parameter vector involves more than the variance and the 
correlation length 


C. Radioactivity data in two dimensions 

This example focuses on daily means of radioactivity gamma dose rates over part of 
the Federal Republic of Germany. The data were provided by the German automatic ra¬ 
dioactivity monitoring network for the Spatial Interpolation Comparison Exercise 2004 (SIC 
2004) [12]. This data set is well studied and thus allows easy comparisons with other meth¬ 
ods [13j. The 1008 stations are partitioned into a training set of 200 randomly selected 
locations and a validation set of 808 locations where predictions are compared with the 
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observations. Two different scenarios are investigated: A normal data set corresponding to 
typical background radioactivity measurements, and an emergency data set, in which a local 
release of radioactivity in the southwest corner of the monitored area was simulated using a 
dispersion process to obtain a few values with magnitudes around 10 times higher than the 
background. The rates are measured in nanoSievert per hour (nSv/h). The normal training 
set follows the Gaussian distribution with the minimum around 58 nSv/h and the maximum 
around 153 nSv/h. In the emergency training set there are two values > 1000 nSv/h, with 
the maximum at 1499 nSv/h. We compare the prediction performance of the SLI model 
against the 808 values of the validation set. 


1. Normal data 


For normal data, the optimal SLI parameters based on the training set with a quadratic 
kernel and k = 2 are given by aq « 143,02 ~ 47.56, p ~ 2.64, A ~ 3.24 x 10 3 . Figure 5a 


illustrates the relative values of the bandwidths used. Higher values correspond to more 
isolated points in areas of low sampling density and along the boundaries of the convex 


hull of the domain. Figure 5b presents the natural logarithm of the absolute value of the 
precision matrix. Overall, about 32% (i.e., 12 718) of the total number of pairs yield nonzero 
precision values, implying that the sparsity of the precision matrix is ~ 68%. 

The cross-validation results are tabulated in Table m The cross validation measures 
(based on the validation set) obtained in a recent study by means of Ordinary Kriging are: 
ME= —1.36, MAE= 9.29, RMSE= 12.59, r— 0.78 [32] • These values are in close agreement 
with the SLI results in Table UT1 

Various geostatistical and machine learning methods have been applied to the SIC 2004 
data (neural networks, geostatistics, and splines). Excluding the results of some poor 
performers, the cross validation measures obtained are in the following ranges [32]: ME 
G [-1.39,-0.04] and G [0.20,1.60], MAE G [9.05,12.10], RMSE G [12.43,15.90], and r 
G [0.64, 0.79]. Hence, the SLI cross validation results are close to the best performers. Fig. [6] 
presents a map of the radioactivity pattern generated by SLI and contrasts it with the map 
generated by bilinear interpolation. The SLI spatial pattern is smoother and thus appears 
more realistic. Its smoothing effect near the sample values, however, is more pronounced 
than that caused by bilinear interpolation. 
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FIG. 5: Analysis of SIC 2004 normal data set using normalized coordinates; the longest 
side is set to 100 and the aspect ratio is maintained, (a) Bubble plot of the relative size of 
local bandwidths: larger circles correspond to bigger bandwidths. The continuous line 
along the domain boundary marks the convex hull of the sampling set. (b) Logarithm of 
the absolute values of the precision matrix elements. Darker areas (blue online) correspond 
to lower values, whereas lighter areas (red online) correspond to higher values. 


TABLE II: Cross validation performance measures for SIC 2004 normal data. The second 
row presents the performance of the SLI predictor at the 808 validation set points. ME: 
Mean error (bias); MAE: Mean absolute error; MARE: Mean absolute relative error; 
RMSE: Root mean square error; r: Pearson correlation coefficient. 

SLI ME MAE MARE RMSE r 

Validation set —1.30 9.30 0.09 12.62 0.78 


We also conduct a stability analysis by removing one sampling point at a time and deter¬ 
mining the optimal SLI model using leave-one-out cross validation with that point removed. 
The variation of the SLI parameters is shown in Fig.[7J an, aq and p are quite stable, whereas 
A shows more variability. The spikes in the plots of Fig. [7] are exaggerated by using a nar¬ 
row vertical range to better illustrate the parameter variability. For aq, aq the maximum 
relative variation (with respect to the mean) ranges from a fraction of a thousandth (for oq) 
to few thousandths (for a 2 ); M shows stronger variations, whereas the strongest variation 
is exhibited by A, since the latter is a scaling factor that determines the overall energy of 
the ensemble of points and compensates for variations in the other parameters. We believe 
that the parameter variations exhibited in Fig. [7] are, at least partially due to extremely 
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FIG. 6: Map of background radioactivity rates in Germany (based on 200 training data) 
using a mapping grid with 100 nodes per side, (a) Bilinear interpolation using the griddata 
command of MATLAB. (b) SLI interpolation map using the optimal parameters reported 

in the text. 


slow variation of the cost function over a region of the parameter space, a condition also 
observed in maximum likelihood estimation of spatial models with Matern covariance [T6]. 
This slow variation implies quasi-degeneracy of the parameter vector; the quasi-degeneracy 
implies that vectors which are very far in parameter space may lead to very similar cost 
function values. More recently, the difficulties involved in nonlinear fits of multi-parametric 
models to data have been investigated in E2- 


2. Emergency data 


For the SIC 2004 emergency data, cross validation results with different kernel functions 


(tricubic, exponential, and quadratic) are shown in Table III The SLI parameters are ini¬ 
tialized using the optimal values for the normal data. The last row of the table is based on 
estimation with a quadratic kernel function after removing the three highest values. The 
best results in Table EHI are obtained with the quadratic kernel including all the data. The 
optimal SLI parameters are aq ~ 143, aq ~ 47.56, /i ~ 2.69, A ~ 4.32 x 10 5 . The parameters, 
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FIG. 7: Variation of SLI parameters estimated by removing one value at a time from the 
200 training locations of the SIC 2004 normal data. 

except for A, are close to their normal case counterparts. The difference in A is due to the 
much higher variance of the emergency data set. 

The variation of the SLI parameters in leave-one-out cross validation exhibits similar 
patterns as for the normal data, except that more pronounced variations of A are observed 
when the extreme values are removed. The precision matrix has 23 232 non-zero elements, 
implying a sparsity of ~ 42%, in contrast with 12 718 (sparsity ~ 32%) in the normal case. 
This difference clearly illustrates the dependence of the precision matrix on the sample 
values in addition to the sampling pattern. SLI does not rely on estimating the variogram 
function, and thus it is not hindered by the presence of extreme values. On the other hand, 
geostatistical methods rely on the variogram function, which may not be reliably estimated in 
such cases [19]. The Pearson correlation coefficient is significantly lower than in the normal 
set due to underestimation of the extreme values, while the Spearman rank correlation 
coefficient is comparable to the normal case. The cross validation measures obtained in SIC 


22 
















































2004 are in the following intervals [12]: ME G [-11.10,-0.12] and G [0.41,19.71], MAE 
G [14.85,146.36], RMSE G [45.46,212.10], and r G [0.02,0.86]. The best performance in 
terms of both MAE and RMSE was obtained by means of a Generalized Regression Neural 
Network [36]. Looking at the scatter plot of MAE versus RMSE —Fig. 6 in [IT] — the SLI 
performance is closer to the geostatistical and spline methods. 

TABLE III: SLI cross validation performance measures for SIC 2004 emergency data. The 
second row presents the performance of the SLI predictor at the 808 validation set points. 
The first five cross validation measures are as described in the caption of Table [TT] and r$ 

is the Spearman correlation coefficient. 


SLI Kernel function 

ME 

MAE : 

MARE 

RMSE 

r 

rs 

Tricubic: (1 — w 3 ) 3 

5.78 

24.22 

0.20 

81.33 

0.25 

0.57 

Exponential: e~ u 

6.06 

23.84 

0.19 

79.78 

0.34 

0.63 

Quadratic 

3.04 

23.16 

0.17 

75.63 

0.43 

0.77 

Quadratic (outliers removed) 

-8.28 

16.46 

0.10 

81.41 

0.27 

0.77 


V. DISCUSSION 

A. Connections with machine learning 

The SLI model is similar to fc-Nearest Neighbors (KNN), since both methods employ 
an optimal neighborhood range. In the case of KNN a uniform optimal number of nearest 
neighbors is determined, and the estimate at an unmeasured point is simply the mean of 
its k nearest neighbors. In SLI, a locally optimal neighborhood size is determined implying 
that the number of neighbors used in prediction varies locally. In addition, the estimate is 
a weighted mean of the neighbor values, in which the weights are determined by the kernel 
function and the bandwidths. In this respect, SLI is similar to the Nadaraya-Watson kernel 
regression method mm and to the Support Vector Machine algorithm [39]. SLI can also 
be viewed as a particular type of Gaussian process with a sparse inverse covariance kernel, 
which could be used as an alternative to the sparse Gaussian process framework to improve 
the computational efficiency of predictions [9]. 

In this study we formulated the SLI model using the spatial locations Sn as inputs and 
the respective values of the scalar held values as outputs. This framework is appropriate for 
scattered spatial data. It is possible, however, to use more general input variables instead 
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of the spatial locations, so long as a suitable measure of distance can be defined. 


B. Notes on implementation 

We presented a “plain vanilla” version of the SLI model. Modifications that can increase 
the flexibility but also the complexity of the model are possible. The local kernel band- 
widths are determined by fixing the neighbor order k and using a uniform scaling parameter 
/i. Alternatively, one can consider estimating k from the data and using a locally varying 
/i. With respect to the latter, potential gains should be weighted against the loss of com¬ 
putational efficiency that will result from the significant increase of the parameter vector 
size. While our estimate of px is based on the sample mean, it is possible to estimate px 
by means of the leave-one-out cross validation procedure. It is also possible to replace px 
with a space-dependent trend function. 

The present version of the SLI model does not involve anisotropy. Nevertheless, anisotropy 
is important in cases such as the radioactivity emergency data [33]: the best performing 
method in SIC 2004 for this set was a general regression neural network with an anisotropic 
Gaussian kernel function. Similarly, in SLI it is possible to use weighted Euclidean dis¬ 
tances or Minkowski metrics instead of the classical Euclidean distance [4]. SLI can also 
be extended to spherical surfaces, a case which is relevant for global geospatial data. In 
addition, the SLI model can capture correlations in higher-dimensional, abstract feature 
spaces equipped with a suitable distance. 

At this point there is no rigorous physical interpretation of the coefficients a i, a 2 and 
A. In general, higher values of aq ( 0 : 2 ) imply higher cost for gradient (curvature), whereas 
A controls the overall “energy”. I 11 the continuum case (i.e., for Spartan random fields) 
coefficients aq and 0:2 are related to a rigidity coefficient and a characteristic length [2D El]. 
A similar correspondence can also be established for data distributed on rectangular grids. 
In contrast, such relations are not available for scattered data. Even in the continuum and 
grid cases, however, statistical measures such as the variance and the correlation length 
have a nonlinear dependence on the SSRF model parameters [2D El]. A reasonable initial 
value for /1 is around 2-3, to allow even compactly supported kernel functions to build local 
neighborhoods containing at least a few data points. For aq and « 2 , we have used positive 
values between the arbitrary bounds of 0.5 and 300. Exploratory runs with different initial 
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conditions can help to locate a reasonable starting point. Alternatively, a global optimization 


approach can be used as in Section IV B 


We have opted for a cross-validation cost functional which is based on the mean absolute 
error. It is possible to use different cost functionals that involve a linear combination of 
validation measures such as the mean absolute error and the root mean square error. Most 
results for the case studies investigated above were obtained using an interior-point opti¬ 
mization method that searches for local minima of the cost function. In all of the cases that 
we have investigated (including data not presented herein), the local optimization led to 
reasonable cross validation measures which were comparable to those obtained with other 
methods. As we have shown in the case of 4 D synthetic data, searching for global op¬ 
tima does not necessarily lead to significant performance improvement. The investigation of 
global optimization methods with different data sets, however, deserves further attention. 


VI. CONCLUSIONS 


The SLI model presented above provides a bridge between geostatistics and machine 
learning. It is based on an exponential joint density which involves an energy functional with 
an explicit precision (inverse covariance) matrix. The latter is constructed by superimposing 
network sub-matrices that implement local interactions between neighboring field values in 
terms of kernel functions. The algorithmic complexity of SLI missing value estimation scales 
linearly with the sample size except for a global 0(N 2 ) term which is, however, computed 
once for all the prediction points. Hence, the leave-one-out cross-validation approach can 
be used to efficiently infer the SLI model parameters. 


For missing data on rectangular grids (ongoing research) the computational complexity 
of the SLI method can be simplified to linear scaling with N , because <Si and S -2 can be 
calculated without kernel functions 53 0H]. In addition, calculating and storing the large 
N x N distance matrix is not necessary in this case. In conclusion, the SLI model is a 
promising tool for the analysis of big spatial data. In future research we will investigate 
the extension of the model to space-time data. Finally, the Matlab code used for the 


case studies in Section [TV] is available at the web address of the Geostatistics laboratory: 

http://www.geostatistics.tuc.gr/4940.htral. 
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Appendix A: Minimization of NLL 


For the pdf given by ([Tj) , the log-likelihood is given by 


LL(x s ; 6) = In L(x s ;0) = -if x (x s ; 0) - hr Z{0). 


(Al) 


The partition function in (Al) is given by the multiple integral 


N poo 

z{o) = n / dx i ex p (~ h 

i=i 


x(x s ; 0)) • 


(A2) 


The square gradient and square curvature terms do not depend on /x x because they involve 
differences Xi — Xj. Hence, we can express Q as follows 


tj / a \ It -f/n\ i Rx 2p,xRx 

F7 x (x s ; 6) = -xg J(0) x s + — - - —, 


(A3) 


where /i x is the sample mean. Maximizing the NLL with respect to /ix, using (A3) for the 
energy functional, yields 

Rx = Ex- 


Since this fixes the parameter /ix, we can use expression ([8]) for the energy functional. 
We apply the scaling transformation Hx(x s; 0) = H(x. s; 0_a)/A, where 6 \ is the parameter 
vector except for A and H(x s; 0_a) is A-independent. The transformation H(-) i—> H(-) is 
equivalent to x % i—^ y t — (Xi — ~jJ x ) /\/A. Let us then define the scaled partition function 
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Z(0_ x ) by means of 


z(fi- a) = 


N 

n 


i =i 

N 


dy.i exp ( —H(y, 6 


/ <OO 

dxi exp (—i/(x s ;0)) 

-OO 

=A“ 7V/2 Z(0). 


(A4) 


In light of the above transformations, the dependence of NLL on A takes the following 
explicit form 

NLL(x s ;0) = ^ In A + In Z(0-\). 

Hence, by minimizing NLL with respect to A, i.e., dNL ^ s,e> = 0, we obtain the following 
expression for the optimal A: 


A* = 


2 

N 


(AS) 


From the Gaussian joint pdf (|8|) it follows that 


Z(0-x) = (2n) N ' 2 {det 


- 1/2 


where J{6-\) = A J{0). We insert the optimal value A* in NLL and use the expression 
above for the log-partition function which leads to 


NLL(x s ; 0_ x ) lid 2g(x ^ 9 - A> ) + j ln(2ir) - \ det J(0_ x , 


(A6) 


The NLL (A6) is minimized numerically using the MATLAB constrained minimization func¬ 


tion fmincon. Constraints are used to ensure that the parameter values are positive. The 
log-determinant is calculated numerically using the singular value decomposition of the pre¬ 
cision matrix. This is a procedure with numerical complexity 0(N 3 ) for a full rank matrix. 
For this reason, we use cross validation instead of maximum likelihood for parameter infer¬ 
ence. 
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